Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)#1493
Merged
cocohearts merged 1 commit intoopenai:mainfrom Apr 9, 2026
Conversation
…25 + Legal TTT — val_bpb 1.0810 (3-seed mean) 3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was essential for running the experiments that led to this result. The grant covered ~320 compute hours across 160+ experiments over Steps 1-22 of our optimization journey. |
owizdom
added a commit
to owizdom/parameter-golf
that referenced
this pull request
Apr 9, 2026
…nthesis (validation pending) First submission to stack three independently-legal val-data adaptations on the PR openai#1487 (1.0600) base: 1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations to align quantization with the eval distribution (novel on the modern stack; PR openai#1019 ablated this on its older base only) 3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering (Track B, builds on PR openai#1493) The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487 (1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars. Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val function, plus 8 hyperparameter defaults flipped). Architecture, optimizer, training loop, EMA, and quantization machinery are byte-identical to PR openai#1487. Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100 SXM time on RunPod; see VALIDATION.md. Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time score-first TTT). No SLOT, no n-gram cache, no ETLB. Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun, PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955, PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.
7 tasks
SH-Tan
pushed a commit
to SH-Tan/parameter-golf
that referenced
this pull request
Apr 9, 2026
Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 9, 2026
…val_bpb 1.07983 3-seed mean val_bpb 1.07983 (std 0.00050) on the PR openai#1394 sp8192 stack. Changes from PR openai#1394 + PR openai#1413 baseline: - Muon momentum = 0.97 (vs 0.99 default), warmup 0.92→0.97 unchanged - Causal token n-gram tilt (base_beta=2.0, agree_bonus=0.1) on top of legal score-first TTT; within-word and word-start experts explicitly disabled (within_beta=0, word_beta=0) because they cannot be made fully causal. - 3-seed verification (seeds 0/42/1234) Seeds: - seed 0 → 1.07928 bpb / 2.78790 nats / 15,993,346 bytes - seed 42 → 1.07997 bpb / 2.78967 nats / 15,992,995 bytes - seed 1234 → 1.08025 bpb / 2.79039 nats / 15,994,604 bytes - mean → 1.07983 bpb / 2.78932 nats / 15,993,648 bytes Delta vs current merged SOTA PR openai#1493 (1.0810): 0.00117 bpb / 0.00302 nats per token Credits: @clarkkev (base PR openai#1394 sp8192 stack), @abaybektursun (n-gram tilt kernel PR openai#1420, causal fix applied), prior legal-TTT precedent PR openai#549 / PR openai#461. Platform: 8xH100 80GB SXM, PyTorch 2.9.1+cu128. Training 588s, eval <437s per seed, both under the 600s budget. Artifact under 16 MB on all 3 seeds.
7 tasks
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
Added Parallel Residuals to Block.forward (gated by USE_PARALLEL_RESIDUALS=1): when enabled, attn and mlp branches both consume the same normalized x_in instead of mlp consuming attn's output. This is the technique used by leaderboard openai#1 (PR openai#1493/openai#1477). Inductor can fuse the two branches better and val_bpb improves ~0.005-0.01 BPB. Default off so existing recipes unchanged. Added USE_PARALLEL_RESIDUALS env var wiring in submission/run.sh + config-print line. New submission/dry_run.sh wrapper — single-command launcher for our H100 dry-run config: - NUM_LAYERS=8 MLP_MULT=2 (compute-efficient sweet spot from A6000) - NUM_LOOPS=2 LOOP_START=3 LOOP_END=5 (3-layer recurrence, comp openai#1) - QK_GAIN_INIT=5.25 (comp openai#1) - USE_PARALLEL_RESIDUALS=1 (just ported) - USE_PARALLEL_MUON=1 (our discovery) - MATRIX_BITS=8 USE_CMP_QUANT_VALUE_DEDUP=0 (our int8 fix) - TORCH_COMPILE_MODE=max-autotune-no-cudagraphs USE_CUDNN_BENCHMARK=1 - PREQUANT_TTT_ENABLED=0 (illegal, disabled) - TTT_ENABLED=1 TTT_EPOCHS=3 (legal score-first) - SLIDING_WINDOW_ENABLED=1 - MAX_WALLCLOCK_SECONDS=600 Expected on 1×H100 PCIe: val_bpb ~1.10-1.20 (validates A6000 projection) Expected on 8×H100 SXM: val_bpb ~1.00-1.07 (potentially beats openai#1 = 1.0810) The submission val_bpb to read is the 'legal_ttt_exact val_bpb' line.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
…leaderboard openai#1) Verified actual openai/parameter-golf merged leaderboard via gh pr view 1493: PR openai#1493 (val_bpb 1.0810, merged 2026-04-09) is the real leaderboard openai#1. Earlier PHASE2_RESULTS.md notes citing PR openai#1485 (1.0679) and PR openai#1482 (1.0787) were wrong — those PRs are not in the merged set. Pre-Quant AdamW TTT is not in any merged PR; PR openai#1493 explicitly says "no pre-quant TTT". Changes to dry_run.sh: - NUM_LAYERS 8 -> 6 (revert to CHAMP_D validated, l=8 was unvalidated) - EMA_DECAY=0.9965 (PR openai#1493, was default 0.997) - WARMDOWN_FRAC=0.72 (PR openai#1493, was default 0.667) - ENABLE_LOOPING_AT=0.35 (PR openai#1493, was default 0.5) - Comment fix: PreQuant TTT removed "illegal" claim, replaced with "PR openai#1493 explicitly does not use pre-quant TTT" - Header rewrite: now references PR openai#1493 not "leaderboard openai#1 = 1.0810" Changes to PHASE2_RESULTS.md: - Replaced stale comp anchor table with verified merged leaderboard - Added warning about prior bogus PR openai#1485/openai#1482 anchors Note on Parallel Residuals topology mismatch: PR openai#1493 applies parallel residuals from layer 7+ (5 of 11 layers). Our impl is binary and applies to all layers — with NUM_LAYERS=6 that means all 6 layers parallel, which is a different topology than PR openai#1493 has validated. Keeping at USE_PARALLEL_RESIDUALS=1 per user direction; flagging here so it shows up in any post-mortem if results are weird.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
…enai#1493 exact match) From PR openai#1493 train_seed314.log Hyperparameters dump: - muon_wd: 0.095 (we default to 0.085) - matrix_lr: 0.022 (we default to 0.020) Both are zero-risk exact-match cheap wins. Decision logged on USE_PARALLEL_RESIDUALS=1: keeping at 1 (all 6 layers parallel) deliberately, not switching to PR openai#1493's L7+ pattern. Reasoning: with NUM_LAYERS=6 the "early layers need serial composition" principle bites less hard than at 11L, and we want max speed for more steps on 1xH100 PCIe. We're trying to BEAT 1.0810, not match it -- aggression is required somewhere, parallel residuals are a low-risk place to find it. Two-lane PARALLEL_START_LAYER mechanism (default 7, no-op for 6L) deliberately left untouched -- separate untested architecture, save for post-dry-run experiments. Decision logged on second dry run as "match PR openai#1493 exactly": explicitly rejected by user. We bet on our smaller-model + int8 stack, not a literal reproduction.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
… + records folder Three changes per user direction: 1. train.py: rename timed_eval label legal_ttt_exact -> quantized_ttt to match comp convention (PR openai#1493 uses this exact label). Pure cosmetic 1-LOC fix, no behavior change. 2. dry_run.sh: refactor to be the SINGLE canonical entry point for both dry run and real submission via SEEDS env var: bash submission/dry_run.sh # dry run (default SEEDS=42) SEEDS=42,314,999 bash submission/dry_run.sh # real 3-seed submission Same code path, env-flip only. The whole config (architecture, hyperparams, n-gram stack, TTT) is identical between the two modes -- only the seed loop differs. 3. dry_run.sh: assemble a complete comp records folder under records/track_10min_16mb/<date>_<config-tag>/ with: README.md, submission.json, train_gpt.py, train_seed<N>.log per-seed logs, and per-seed final_model_seed<N>.int6.ptz artifacts. submission.json is generated by an inline python script that: - parses each seed's train log for the quantized_ttt val_bpb line - computes mean + std across seeds - detects hardware via nvidia-smi - fills the compliance flags honestly (no_ngram_cache: false since we DO use n-gram bias -- this is potentially a Track B rule problem, flagged in the README for follow-up) - emits the 36-line submission.json format that PR openai#1493 uses README.md is templated with per-seed results table, technique list, compliance section, reproduction instructions, attribution. train.py is copied as train_gpt.py to the records folder (NOT LZMA-wrapped yet -- that's a code-size compliance follow-up if/when needed). Note on n-gram legality: PR openai#1493's compliance section says "no n-gram cache, no logit biasing" per Issue openai#1017 Track B. Our submission flags no_ngram_cache: false honestly. Whether this submission is comp-legal under Track A or any other track is an open question that needs resolution before merging as a record. Flagged in the README.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
Two changes per user direction (rule compliance + comp file format): 1. DISABLE n-gram bias stack (rule compliance) USE_NGRAM_BIAS=0, USE_NGRAM_BACKOFF=0, USE_NGR_LOG_FREQ_INV=0, USE_CTX_PARTITIONED_TAB=0. Reason: PR openai#1493's compliance section cites Issue openai#1017 Track B Condition 2: "Standard softmax over full vocab. No n-gram cache, no logit biasing." Our USE_NGRAM_BIAS adds precomputed n-gram log-prob bias to logits at the end of forward(), which directly violates this condition. We don't yet know whether the rule applies only to Track B (legal-eval-time-adaptation track) or to all submissions, but the user's policy is clear: nothing illegal. Disable until verified. N-gram tables are still BUILT during get_data.sh bootstrap (cheap, no harm) but unused at training/eval time when USE_NGRAM_BIAS=0. Other Phase 1 wins kept (all believed legal): - USE_GATED_ATTENTION (architectural, NeurIPS 2025) - USE_NORMUON (optimizer variant) - USE_NORM_PCT_DROPOUT (training-time regularizer) - USE_PREFETCH_LOADER (data pipeline) 2. LZMA-wrap train_gpt.py (PR openai#1493 file format) The records folder assembly step now LZMA-wraps submission/train.py into a 2-line train_gpt.py matching PR openai#1493's format: import lzma as L,base64 as B exec(L.decompress(B.b85decode("..."),format=L.FORMAT_RAW,...)) Sanity-decodes after wrapping to verify the roundtrip. Sizing: - submission/train.py raw: 83,320 bytes - LZMA-wrapped train_gpt.py: 28,916 bytes (34.7% of raw) - PR openai#1493 wrapped train_gpt.py: 16,594 bytes - Our artifact (CHAMP_D int8): ~9,555,838 bytes (~9.55 MB) - Total submission (artifact + code): ~9.58 MB / 16 MB cap (60%) Plenty of code-size headroom. Our train.py is bigger than PR openai#1493's because we carry more infrastructure (n-gram code, NIGHT_MODE features, optional speed paths) but the wrapped form fits comfortably.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
…8 + dedup User decision: stop betting on smaller-model + int8 alone (CHAMP_D 6L+2x) because on 8xH100 SXM the binding constraint is model capacity, not training compute. Flipping to PR openai#1493's proven architecture (11L+4x) and stacking our int8 quant + parallel muon + parallel residuals on top. Changes: - NUM_LAYERS: 6 -> 11 (match PR openai#1493) - MLP_MULT: 2 -> 4 (match PR openai#1493) - USE_PARALLEL_RESIDUALS: 1 -> 0 (binary all-layers flag, replaced by below) - PARALLEL_RESIDUAL_START: 7 (NEW per-block start parameter, matches PR openai#1493 exactly: layers 0-6 serial, layers 7-10 parallel residual GPT-J style) - USE_CMP_QUANT_VALUE_DEDUP: 0 -> 1 (RE-ENABLED, NIGHT_MODE n=2 confirmed L10 alphabet-snap compression. Was disabled with int8 because I assumed it would hurt cleanliness -- that assumption was never validated. Re-enabling because (a) we need ~10-15% compression to fit 11L+4x int8 in 16 MB cap and (b) restoring a previously validated win I dropped without good reason. - Records folder tag: SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT train.py changes (6 LOC): - Block.__init__: now reads PARALLEL_RESIDUAL_START env var, sets _parallel_residuals=True for layer_idx >= PARALLEL_RESIDUAL_START. Falls back to USE_PARALLEL_RESIDUALS binary flag if PARALLEL_RESIDUAL_START=-1. - Block.__init__: stores layer_idx as self.layer_idx for the check - Hyperparameters: added parallel_residual_start field (env-driven, default -1) Math: - PR openai#1493 baseline: 1.0810 - Int8 quant savings (vs their int6): -0.011 BPB - Parallel muon: ~0 BPB (speed only) - CMP_QUANT_VALUE_DEDUP: ~+0.005 BPB cost from alphabet snap - Net projection: ~1.072-1.078 - Probability of beating 1.0760 (record threshold): ~30% Risks: - Int8 quant at 11L+4x scale is UNTESTED (CHAMP_E was killed mid-run) - 11L+4x int8 + brotli + dedup might still be over 16 MB cap (CHAMP_D had 9.55 MB at 6L+2x; this is ~1.7x more params, projected ~14-16 MB) - PARALLEL_RESIDUAL_START is brand new code, never run end-to-end Pre-flight: dry_run.sh syntax check passes.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
…ng silently disabled)
THE BIGGEST DROPPED WIN, found via deep audit of our experiment history.
Bug: run.sh:73-74 hardcodes:
TORCH_COMPILE_DISABLE="${TORCH_COMPILE_DISABLE:-1}"
TORCHDYNAMO_DISABLE="${TORCHDYNAMO_DISABLE:-1}"
dry_run.sh was setting TORCH_COMPILE_MODE=max-autotune-no-cudagraphs but
that env var does NOTHING when TORCH_COMPILE_DISABLE=1 is in effect, so the
compile path never engaged. The dry_run was running in eager mode the entire
time despite the explicit "compile mode" config.
Phase 2 evidence (PHASE2_RESULTS.md):
- E1 (compile disabled, baseline): 2933 ms/step
- E2 (compile re-enabled with default mode): 1581 ms/step (+85% / 1.85x)
- E4b (compile + max-autotune-no-cudagraphs): 1526 ms/step (+92% / 1.92x)
Measured on RTX 3090. On 8xH100 SXM with the 11L+4x model the speedup
will be more like 3-5x because H100 has way more matmul throughput that
the eager-mode kernel launch overhead bottlenecks.
Impact: without compile, our 600s training budget gets us approximately
HALF the training steps PR openai#1493 gets at the same architecture. Their
4557 steps -> our ~2200 steps without compile. Catastrophic convergence
loss. With compile re-enabled we should match or exceed their step count.
Fix: explicitly export TORCH_COMPILE_DISABLE=0 and TORCHDYNAMO_DISABLE=0
in dry_run.sh BEFORE bash submission/run.sh. The variables are already
in run.sh's explicit env-passing list at line 251-252 so the override
propagates correctly.
Caught via Explore agent audit of all PHASE2_RESULTS, NIGHT_MODE.md,
PHASE2_PLAN.md, run.sh, and submission/train.py to find any validated
win not in the current dry_run.sh.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
…ission Phase 2 was the speed/quality experimentation work (E1-E31, CHAMP_A/B/C/D/E/F). That's done. The current 8xH100 SXM run is the REAL openai/parameter-golf submission attempt and deserves its own state file. Created SUBMISSION_RUN_STATE.md with: - Pod info (aklt7paqnjwhal, 8x H100 SXM, $21.52/hr) - Full Option C config dump - Targets (PR openai#1493 = 1.0810, record threshold = 1.0760) - Output records folder location - Fire log table (ready for the cron to append per-fire) Removed the Pod O block from PHASE2_AUTOMATION_STATE.md (I had wrongly added it there during the 01:57Z fire). PHASE2_AUTOMATION_STATE.md now ends with "Phase 2 work is complete" and points at SUBMISSION_RUN_STATE.md. Cron be912385 deleted, replaced with 49457147 (same 10-min schedule, same pod) — new prompt writes to SUBMISSION_RUN_STATE.md, tags commits [submission] instead of [phase2-driver].
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
… single-GPU) THE bug: run.sh hardcoded `python3 -u submission/train.py` which always spawns a single Python process -> world_size=1 -> ONE GPU used. On the 8xH100 SXM real submission run we caught this with the GPU dashboard showing only GPU 7 at 100% and the other 7 idle. We were paying for 8 GPUs and using 1. PR openai#1493 launches with: torchrun --standalone --nproc_per_node=8 train_gpt.py train.py already supports distributed via WORLD_SIZE/RANK/LOCAL_RANK env vars (see train.py:1065-1071) -- it just needs a torchrun launcher. Fix: auto-detect GPU count via nvidia-smi, use torchrun when > 1 GPU, fall back to python3 for single-GPU runs (preserves the local 1xPCIe dry-run path). NPROC_PER_NODE override is honored if set (lets us cap at 4 if we want to run partial-machine experiments). The Explore agent flagged this earlier in the audit. I noted it but said "not needed for dry run on 1xH100 PCIe" -- which was the wrong call for the real 8xH100 SXM submission. Should have fixed it in the same pass as the torch.compile re-enable. My miss, costs ~$13 of pod time.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
…at 11L+4x) Seed 42 results from retry 4: - pre-quant val_bpb 1.0896 — EXCELLENT (0.002 from PR openai#1493's 1.0878) - int8 quantized val_bpb 4.5461 — CATASTROPHIC (3.46 BPB gap) - artifact 19,559,800 bytes — OVER 16 MB CAP (19.6 MB) Root cause: 36M params × 8 bits per param = too many bytes for brotli to compress under 16 MB. CMP_QUANT_VALUE_DEDUP=1 made it worse (post-quant alphabet snap destroyed the fine weight structure on top of the size issue). Fix: switch to MATRIX_BITS=6 + EMBED_BITS=8 (PR openai#1493's exact setup). Proven to fit 16 MB. Proven quant gap 0.012 BPB. Disable dedup. Also: explicitly pass WARMDOWN_FRAC, EMA_DECAY, ENABLE_LOOPING_AT, MUON_WD, MATRIX_LR, PARALLEL_RESIDUAL_START, MATRIX_BITS, EMBED_BITS in run.sh's env-passing list for torchrun. Env inheritance WAS working (verified from seed 42 log) but explicit is safer with torchrun multi-process. Projected with int6: pre-quant ~1.089, quant gap +0.012, sliding -0.017, TTT -0.002 = final ~1.082. Close to PR openai#1493's 1.081 but likely not a record (threshold 1.076). Running anyway — the NIGHT_MODE features (gated_attention, normuon, norm_pct_dropout) might close the gap.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
!) Retry 5 seed 42 results (int6 quant, full eval pipeline): pre-quant val_bpb: 1.08982 (PR openai#1493: 1.08775, gap +0.002) quantized val_bpb: 1.10014 (PR openai#1493: 1.09947, gap +0.001) quantized_sliding: 1.08327 (PR openai#1493: 1.08271, gap +0.001) quantized_ttt: 1.08243 (PR openai#1493: 1.08103, gap +0.001) Our int6 quant gap: 0.010 BPB (BETTER than PR openai#1493's 0.012!) Our model is 0.0014 behind PR openai#1493 overall. Would be leaderboard openai#2. ISSUE: artifact 16,051,299 bytes — 51 KB over 16 MB cap (16,000,000). Fixable with CMP_QUANT_VALUE_DEDUP=1 (~10-15% smaller) — at int6 scale the dedup is safe (retry 4's catastrophe was int8+dedup combo). Seeds 314/999 running for 3-seed mean. Will have same 51 KB oversize but the val_bpb data is worth collecting before fixing artifact size.
taka6745
pushed a commit
to taka6745/parameter-golf
that referenced
this pull request
Apr 10, 2026
…or 16 MB fit Two changes queued for the next run (not yet launched): 1. PARALLEL_START_LAYER=-1 (CRITICAL BUG FIX) The pre-existing two-lane decoder split mechanism (GPT.__init__:349, default PARALLEL_START_LAYER=7) was SILENTLY OVERRIDING our per-block PARALLEL_RESIDUAL_START=7 for blocks 7-10. Instead of calling Block.forward() (which has our GPT-J parallel residuals logic), the code called forward_attn/forward_mlp on SEPARATE LANES merged once at the end via lane_merge. This is architecturally different from PR openai#1493's GPT-J per-block parallel, and was never validated. Fix: set PARALLEL_START_LAYER=-1 to disable the two-lane mechanism. Block.forward() then handles all blocks, and PARALLEL_RESIDUAL_START=7 gives proper per-block GPT-J parallel matching PR openai#1493. Expected impact: -0.001 to -0.003 BPB (architectural correction). 2. CMP_QUANT_VALUE_DEDUP=1 (SIZE FIX) Retry 5 artifact was 16,051,299 bytes (51 KB over 16 MB cap). Dedup should save ~10-15% on compressed artifact. Retry 4's catastrophic gap was int8+dedup; int6+dedup is a different combo and should be safe per NIGHT_MODE validation. Plan: single-seed (SEEDS=42) validation on the existing pod after retry 5 finishes. Cost ~$8. If val_bpb improves + artifact fits, submit PR + request credits for 3-seed validation.
6 tasks
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Apr 10, 2026
…nking HIGH priority Key findings from daily scan: - Merged SOTA updated to 1.0810 (bigbag, PR openai#1493, Apr 9) — was stale at 1.1147 - New target: ≤1.0760 bpb (beat by ≥0.005 nats) - ANS weight compression (PR openai#1510): 1.6MB freed = +2.2M params, zero legality risk - Parameter Banking + Parallel Muon (PR openai#1523): +5.2% throughput, ~30 free steps - Free wins: Muon momentum 0.97 (-0.0004 bpb), QK-Gain 5.25 (monotonic vs 5.0) - Per-Pass Loop Embeddings (PR openai#1518): reduces quant gap 0.0131→0.0114 - Do NOT implement: Eval-Time Hash Emb (illegal pattern), Tap-In V6 (await ruling) - CLAUDE.md: updated SOTA, target, current approach, technique table, Session 9 lessons https://claude.ai/code/session_01FLdCggVuuBKQCUy6J3xyss
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
BPB-weighted loss weights each token's CE loss by its UTF-8 byte count, aligning training objective with BPB eval metric. Muon momentum 0.97. Byte weights from base_bytes_lut, clamped min=1.0, non-persistent.
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
Reverted BPB-weighted loss (caused torch.compile slowdown, timed out 2x). Clean forward with standard mean CE. Stacking two proven improvements: - Muon momentum 0.97 (measured -0.00129 in R20v10) - TTT LR 0.01 (measured -0.0003 in PR openai#1523)
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
QK-Gain was 5.0 (code default) but openai#1493 was tested with 5.25 (env var). Env vars not forwarded to GPU — hardcode the correct value. Stacking all three proven hyperparameter improvements.
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 10, 2026
Wider recurrence: blocks 2-5 looped 3x (was blocks 3-5). 19 virtual layers from 11 physical (was 17). Wider span may converge better than deeper with same block range.
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 11, 2026
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 11, 2026
Porting all openai#1523 hyperparams that differ from openai#1493: - EMA_DECAY: 0.9965 -> 0.997 (stronger smoothing) - WARMDOWN_FRAC: 0.72 -> 0.667 (shorter warmdown) - Muon 0.97 (kept from previous best)
dljr-github
added a commit
to dljr-github/parameter-golf
that referenced
this pull request
Apr 11, 2026
Decoded LZMA-compressed SOTA train_gpt.py. Replaced flash_attn_3_func with PyTorch SDPA (transpose to B,H,T,D format + enable_gqa). Full stack: 11L, 4xMLP, LeakyReLU², XSA, depth recurrence, parallel residuals, LN Scale, partial RoPE, EMA, GPTQ SDClip, TTT, brotli. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
translatingthename
added a commit
to translatingthename/parameter-golf
that referenced
this pull request
Apr 11, 2026
…(3-seed mean) 3-seed mean sliding val_bpb: 1.05869 (std 0.00038) Seeds: 42 (1.05840), 1337 (1.05856), 2024 (1.05912) All artifacts under 16,000,000 bytes. Zero pruning needed. Key techniques: - SP8192 tokenizer + GPTQ SDClip (int6 k=12.85, int8 embeddings k=20.0) - 3-layer depth recurrence (L3-5, 14 virtual layers from 11 physical) - Parallel residuals (L7+, GPT-J style) - Pre-quant AdamW TTT (6 epochs, compiled for 2x speedup) - QK-Gain 5.25, MuonEq-R, EMA 0.9965, warmdown 72% Built on: PR openai#1394 @clarkkev, PR openai#1493 @bigbag, PR openai#1485 @ndokutovich
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
3-Seed Results
Merged SOTA (PR #1019): 1.1147 BPP. Delta: −0.0337 BPP.
Key Techniques
Compliance (Track B)
Per Issue #1017:
torch.no_grad()BEFORE SGDNo SLOT, no pre-quant TTT, no ETLB, no n-gram cache. All artifacts < 16MB, train < 600s, eval < 600s.
Credits
PR #1394 @clarkkev, PR #1413 @dexhunter, PR #549 @abaybektursun, PR #1412 @Robby955, PR #1204 @msisovic, PR #1445 @X-Abhishek-X, PR #1331 @dexhunter
Acknowledgements
Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was instrumental in running 160+ experiments that led to this result.
Reproduction
Test plan
🤖 Generated with Claude Code